In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g., model-selection techniques or algorithms for learning with arbitrary function classes) or specialized to particular structures (e.g., nested features or representations with certain spectral properties). As a result, the understanding of the cost of representation learning in contextual linear bandit is still limited. In this paper, we take a systematic approach to the problem and provide a comprehensive study through an instance-dependent perspective. We show that representation learning is fundamentally more complex than linear bandits (i.e., learning with a given representation). In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set, while we show cases where it can be arbitrarily harder. We complement this result with an extensive discussion of how it relates to existing literature and we illustrate positive instances where representation learning is as complex as learning with a fixed representation and where sub-logarithmic regret is achievable.
translated by 谷歌翻译
Active learning with strong and weak labelers considers a practical setting where we have access to both costly but accurate strong labelers and inaccurate but cheap predictions provided by weak labelers. We study this problem in the streaming setting, where decisions must be taken \textit{online}. We design a novel algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy. Prior active learning algorithms with access to weak labelers learn a difference classifier which predicts where the weak labels differ from strong labelers; this requires the strong assumption of realizability of the difference classifier (Zhang and Chaudhuri,2015). WL-AC bypasses this \textit{realizability} assumption and thus is applicable to many real-world scenarios such as random corrupted weak labels and high dimensional family of difference classifiers (\textit{e.g.,} deep neural nets). Moreover, WL-AC cleverly trades off evaluating the quality with full exploitation of weak labelers, which allows to convert any active learning strategy to one that can leverage weak labelers. We provide an instantiation of this template that achieves the optimal query complexity for any given weak labeler, without knowing its accuracy a-priori. Empirically, we propose an instantiation of the WL-AC template that can be efficiently implemented for large-scale models (\textit{e.g}., deep neural nets) and show its effectiveness on the corrupted-MNIST dataset by significantly reducing the number of labels while keeping the same accuracy as in passive learning.
translated by 谷歌翻译
我们考虑一个多武装的强盗设置,在每一轮的开始时,学习者接收嘈杂的独立,并且可能偏见,\ emph {评估}每个臂的真正奖励,它选择$ k $武器的目标累积尽可能多的奖励超过$ $ rounds。在假设每轮在每个臂的真正奖励从固定分发中汲取的,我们得出了不同的算法方法和理论保证,具体取决于评估的生成方式。首先,在观察功能是真正奖励的遗传化线性函数时,我们在一般情况下展示$ \ widetilde {o}(t ^ {2/3})$后悔。另一方面,当观察功能是真正奖励的嘈杂线性函数时,我们就可以派生改进的$ \ widetilde {o}(\ sqrt {t})$后悔。最后,我们报告了一个实证验证,确认我们的理论发现,与替代方法进行了彻底的比较,并进一步支持在实践中实现这一环境的兴趣。
translated by 谷歌翻译
本文研究了Markov决策过程(MDP)的隐私保留探索,线性表示。我们首先考虑线性混合MDP(Ayoub等,2020)(A.K.A.基于模型的设置)的设置,并提供统一的框架,用于分析关节和局部差异私有(DP)探索。通过这个框架,我们证明了一个$ \ widetilde {o}(k ^ {3/4} / \ sqrt {\ epsilon})$遗憾绑定$(\ epsilon,\ delta)$ - 本地DP探索和$ \widetilde {o}(\ sqrt {k / \ epsilon})$后悔绑定$(\ epsilon,\ delta)$ - 联合dp。我们进一步研究了Linear MDP中的隐私保留探索(Jin等,2020)(AKA \ Forws-Free Setting),我们提供$ \ widetilde {o}(\ sqrt {k / \ epsilon})$后悔绑定$(\ epsilon,\ delta)$ - 关节dp,具有基于低切换的新型算法。最后,我们提供了在这种无模型设置中设计本地DP算法的问题的见解。
translated by 谷歌翻译
我们介绍了一种普遍的策略,可实现有效的多目标勘探。它依赖于adagoal,一种基于简单约束优化问题的新的目标选择方案,其自适应地针对目标状态,这既不是太困难也不是根据代理目前的知识达到的。我们展示了Adagoal如何用于解决学习$ \ epsilon $ -optimal的目标条件的政策,以便在$ L $ S_0 $ S_0 $奖励中获得的每一个目标状态,以便在$ S_0 $中获取。免费马尔可夫决策过程。在标准的表格外壳中,我们的算法需要$ \ tilde {o}(l ^ 3 s a \ epsilon ^ { - 2})$探索步骤,这几乎很少最佳。我们还容易在线性混合Markov决策过程中实例化Adagoal,其产生具有线性函数近似的第一目标导向的PAC保证。除了强大的理论保证之外,迈克纳队以现有方法的高级别算法结构为锚定,为目标条件的深度加固学习。
translated by 谷歌翻译
我们改进了用于分析非凸优化随机梯度下降(SGD)的最新工具,以获得香草政策梯度(PG) - 加强和GPOMDP的收敛保证和样本复杂性。我们唯一的假设是预期回报是平滑的w.r.t.策略参数以及其渐变的第二个时刻满足某种\ EMPH {ABC假设}。 ABC的假设允许梯度的第二时刻绑定为\ geq 0 $次的子项优差距,$ b \ geq 0 $乘以完整批量梯度的标准和添加剂常数$ c \ geq 0 $或上述任何组合。我们表明ABC的假设比策略空间上的常用假设更为一般,以证明收敛到静止点。我们在ABC的假设下提供单个融合定理,并表明,尽管ABC假设的一般性,我们恢复了$ \ widetilde {\ mathcal {o}}(\ epsilon ^ {-4})$样本复杂性pg 。我们的融合定理还可在选择超参数等方面提供更大的灵活性,例如步长和批量尺寸的限制$ M $。即使是单个轨迹案例(即,$ M = 1 $)适合我们的分析。我们认为,ABC假设的一般性可以为PG提供理论担保,以至于以前未考虑的更广泛的问题。
translated by 谷歌翻译
我们研究了在随机最短路径(SSP)设置中的学习问题,其中代理试图最小化在达到目标状态之前累积的预期成本。我们设计了一种新型基于模型的算法EB-SSP,仔细地偏离了经验转变,并通过探索奖励来赋予经验成本,以诱导乐观的SSP问题,其相关价值迭代方案被保证收敛。我们证明了EB-SSP实现了Minimax后悔率$ \ tilde {o}(b _ {\ star} \ sqrt {sak})$,其中$ k $是剧集的数量,$ s $是状态的数量, $ a $是行动的数量,而B _ {\ star} $绑定了从任何状态的最佳策略的预期累积成本,从而缩小了下限的差距。有趣的是,EB-SSP在没有参数的同时获得此结果,即,它不需要任何先前的$ B _ {\ star} $的知识,也不需要$ t _ {\ star} $,它绑定了预期的时间 ​​- 任何州的最佳政策的目标。此外,我们说明了各种情况(例如,当$ t _ {\ star} $的订单准确估计可用时,遗憾地仅包含对$ t _ {\ star} $的对数依赖性,因此产生超出有限范围MDP设置的第一个(几乎)的免地相会遗憾。
translated by 谷歌翻译
在线强化学习(RL)中的挑战之一是代理人需要促进对环境的探索和对样品的利用来优化其行为。无论我们是否优化遗憾,采样复杂性,状态空间覆盖范围或模型估计,我们都需要攻击不同的勘探开发权衡。在本文中,我们建议在分离方法组成的探索 - 剥削问题:1)“客观特定”算法(自适应)规定哪些样本以收集到哪些状态,似乎它可以访问a生成模型(即环境的模拟器); 2)负责尽可能快地生成规定样品的“客观无关的”样品收集勘探策略。建立最近在随机最短路径问题中进行探索的方法,我们首先提供一种算法,它给出了每个状态动作对所需的样本$ B(S,a)$的样本数量,需要$ \ tilde {o} (bd + d ^ {3/2} s ^ 2 a)收集$ b = \ sum_ {s,a} b(s,a)$所需样本的$时间步骤,以$ s $各国,$ a $行动和直径$ d $。然后我们展示了这种通用探索算法如何与“客观特定的”策略配对,这些策略规定了解决各种设置的样本要求 - 例如,模型估计,稀疏奖励发现,无需无成本勘探沟通MDP - 我们获得改进或新颖的样本复杂性保证。
translated by 谷歌翻译
Computational units in artificial neural networks follow a simplified model of biological neurons. In the biological model, the output signal of a neuron runs down the axon, splits following the many branches at its end, and passes identically to all the downward neurons of the network. Each of the downward neurons will use their copy of this signal as one of many inputs dendrites, integrate them all and fire an output, if above some threshold. In the artificial neural network, this translates to the fact that the nonlinear filtering of the signal is performed in the upward neuron, meaning that in practice the same activation is shared between all the downward neurons that use that signal as their input. Dendrites thus play a passive role. We propose a slightly more complex model for the biological neuron, where dendrites play an active role: the activation in the output of the upward neuron becomes optional, and instead the signals going through each dendrite undergo independent nonlinear filterings, before the linear combination. We implement this new model into a ReLU computational unit and discuss its biological plausibility. We compare this new computational unit with the standard one and describe it from a geometrical point of view. We provide a Keras implementation of this unit into fully connected and convolutional layers and estimate their FLOPs and weights change. We then use these layers in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, obtaining performance improvements over standard ResNets up to 1.73%. Finally, we prove a universal representation theorem for continuous functions on compact sets and show that this new unit has more representational power than its standard counterpart.
translated by 谷歌翻译
Fruit is a key crop in worldwide agriculture feeding millions of people. The standard supply chain of fruit products involves quality checks to guarantee freshness, taste, and, most of all, safety. An important factor that determines fruit quality is its stage of ripening. This is usually manually classified by experts in the field, which makes it a labor-intensive and error-prone process. Thus, there is an arising need for automation in the process of fruit ripeness classification. Many automatic methods have been proposed that employ a variety of feature descriptors for the food item to be graded. Machine learning and deep learning techniques dominate the top-performing methods. Furthermore, deep learning can operate on raw data and thus relieve the users from having to compute complex engineered features, which are often crop-specific. In this survey, we review the latest methods proposed in the literature to automatize fruit ripeness classification, highlighting the most common feature descriptors they operate on.
translated by 谷歌翻译